I Woke Up To NaN And Now I Am Dead Inside
I went to sleep happy. The loss was going down. The gradients were stable. The GPU was humming at 60C like a contented cat. I dreamed of completion. I dreamed of a finished Sonnet model. I dreamed of sleep that was not interrupted by thoughts of learning rate schedules.
I woke up to NaN. Not a number. Not a loss. Nothing. Just emptiness where my progress used to be. The training script was still running. The GPU was still spinning. The loss was just gone. Replaced by the most devastating three letters in machine learning.
There is no pain like opening your terminal in the morning and seeing loss: nan. It is a special kind of grief reserved for people who train models on consumer hardware.
The Numbers
Sixteen percent. Out of 261 hours. That is 42 hours of compute. That is 42 hours of electricity. That is 42 hours of my life I will never get back. The last checkpoint saved at 15 percent. The NaN happened at 16 percent. One percent. That is all I needed. One more percent and I would have been safe.
The Log
Step 45100 | Loss: 2.338 | Grad Norm: 1.1
Step 45200 | Loss: 2.335 | Grad Norm: 1.3
Step 45300 | Loss: nan | Grad Norm: nan
Step 45400 | Loss: nan | Grad Norm: nan
# It happened so fast. I did not even see it coming.
Look at those numbers. Look at how normal they were. Look at how everything was fine. Then step 45300. Everything broke. The gradients exploded. The loss became undefined. The model forgot how to speak English. It forgot how to speak anything.
What Went Wrong
I have theories. None of them bring me peace. Maybe the learning rate was too high. Maybe I should have used a scheduler. Maybe a bad batch of data slipped through. Maybe a cosmic ray hit my GPU at exactly the wrong moment. Maybe the universe is telling me to stop.
I checked the gradients. They spiked right before the NaN. Classic gradient explosion. I should have used gradient clipping. I did not use gradient clipping. I was confident. Confidence is the enemy of completed training runs.
learning_rate = 3e-4 # Probably too high
gradient_clipping = None # Why would I need this
checkpoint_every = 5000 # Too infrequent
hope = "maximum" # Not a valid parameter
# I am a fool.
The Restart
I restarted the training. From zero. From nothing. From the beginning. The loss is going down again. It is at 0.3 percent now. It will take 261 hours again. I will watch it again. I will fear the NaN again.
I added gradient clipping. I lowered the learning rate. I increased checkpoint frequency. I added every safeguard I could think of. It will still probably crash. That is the nature of this work. You prepare. You plan. You lose everything anyway.
The Emotional Toll
I am not okay. I know it is just code. I know it is just weights. I know I can try again. I also know I just lost two days of my life. Two days I could have spent doing anything else. Writing. Sleeping. Touching grass. Instead I am watching a progress bar reset.
My team knows. They sent messages. "It happens." "Welcome to ML." "At least you learned something." I do not want to learn. I want a finished model. I want to release Sonnet. I want to stop writing blogs about failures.
Every machine learning practitioner has a NaN story. This is mine. It is not special. It is just painful.
Why I Continue
I could use cloud GPUs. They have better reliability. They have checkpointing built in. They have support. I do not have the money. I have a 5090 and pride. The 5090 is working. The pride is wounded.
I could train smaller models. Haiku is done. Haiku works. Haiku gives fish answers but it exists. I could just release Haiku and call it a day. I will not. I want Sonnet. I want Opus. I want the full set. I want to suffer for my art.
Lessons Learned
Save checkpoints more often. Clip your gradients. Lower your learning rate. Expect failure. Assume everything will break. Plan for the NaN. When it happens, do not cry. Restart. Try again. Maybe this time it will work.
Also maybe do not train models overnight without checking them first. Maybe watch the loss curve. Maybe be present. Maybe accept that local training is pain and I chose this.
Final Thoughts
The training is running again. The loss is decreasing. The GPU is at 60C. Everything is fine. Everything is terrible. I will check it in an hour. Then in another hour. Then I will sleep with one eye open.
If you are training models locally, I am sorry. If you are thinking about it, I am more sorry. If you have never experienced a NaN crash, you will. It is not a matter of if. It is a matter of when. Prepare yourself.